Prosper Loan Data Exploration by Yanhua He
========================================================
Referring to Wikipedia, prosper has a transaction-based business model, in which the company collects revenue by taking a fee on its customers’ transactions. Borrowers who receive a loan pay an origination fee of 1.00% to 5.00% depending on the borrower’s Prosper Rating, and investors pay a 1% annual servicing fee.Firstly, let’s take a look at the dataset.
## 'data.frame': 113937 obs. of 81 variables:
## $ ListingKey : Factor w/ 113066 levels "00003546482094282EF90E5",..: 7180 7193 6647 6669 6686 6689 6699 6706 6687 6687 ...
## $ ListingNumber : int 193129 1209647 81716 658116 909464 1074836 750899 768193 1023355 1023355 ...
## $ ListingCreationDate : Factor w/ 113064 levels "2005-11-09 20:44:28.847000000",..: 14184 111894 6429 64760 85967 100310 72556 74019 97834 97834 ...
## $ CreditGrade : Factor w/ 9 levels "","A","AA","B",..: 5 1 8 1 1 1 1 1 1 1 ...
## $ Term : int 36 36 36 36 36 60 36 36 36 36 ...
## $ LoanStatus : Factor w/ 12 levels "Cancelled","Chargedoff",..: 3 4 3 4 4 4 4 4 4 4 ...
## $ ClosedDate : Factor w/ 2803 levels "","2005-11-25 00:00:00",..: 1138 1 1263 1 1 1 1 1 1 1 ...
## $ BorrowerAPR : num 0.165 0.12 0.283 0.125 0.246 ...
## $ BorrowerRate : num 0.158 0.092 0.275 0.0974 0.2085 ...
## $ LenderYield : num 0.138 0.082 0.24 0.0874 0.1985 ...
## $ EstimatedEffectiveYield : num NA 0.0796 NA 0.0849 0.1832 ...
## $ EstimatedLoss : num NA 0.0249 NA 0.0249 0.0925 ...
## $ EstimatedReturn : num NA 0.0547 NA 0.06 0.0907 ...
## $ ProsperRating..numeric. : int NA 6 NA 6 3 5 2 4 7 7 ...
## $ ProsperRating..Alpha. : Factor w/ 8 levels "","A","AA","B",..: 1 2 1 2 6 4 7 5 3 3 ...
## $ ProsperScore : num NA 7 NA 9 4 10 2 4 9 11 ...
## $ ListingCategory..numeric. : int 0 2 0 16 2 1 1 2 7 7 ...
## $ BorrowerState : Factor w/ 52 levels "","AK","AL","AR",..: 7 7 12 12 25 34 18 6 16 16 ...
## $ Occupation : Factor w/ 68 levels "","Accountant/CPA",..: 37 43 37 52 21 43 50 29 24 24 ...
## $ EmploymentStatus : Factor w/ 9 levels "","Employed",..: 9 2 4 2 2 2 2 2 2 2 ...
## $ EmploymentStatusDuration : int 2 44 NA 113 44 82 172 103 269 269 ...
## $ IsBorrowerHomeowner : Factor w/ 2 levels "False","True": 2 1 1 2 2 2 1 1 2 2 ...
## $ CurrentlyInGroup : Factor w/ 2 levels "False","True": 2 1 2 1 1 1 1 1 1 1 ...
## $ GroupKey : Factor w/ 707 levels "","00343376901312423168731",..: 1 1 335 1 1 1 1 1 1 1 ...
## $ DateCreditPulled : Factor w/ 112992 levels "2005-11-09 00:30:04.487000000",..: 14347 111883 6446 64724 85857 100382 72500 73937 97888 97888 ...
## $ CreditScoreRangeLower : int 640 680 480 800 680 740 680 700 820 820 ...
## $ CreditScoreRangeUpper : int 659 699 499 819 699 759 699 719 839 839 ...
## $ FirstRecordedCreditLine : Factor w/ 11586 levels "","1947-08-24 00:00:00",..: 8639 6617 8927 2247 9498 497 8265 7685 5543 5543 ...
## $ CurrentCreditLines : int 5 14 NA 5 19 21 10 6 17 17 ...
## $ OpenCreditLines : int 4 14 NA 5 19 17 7 6 16 16 ...
## $ TotalCreditLinespast7years : int 12 29 3 29 49 49 20 10 32 32 ...
## $ OpenRevolvingAccounts : int 1 13 0 7 6 13 6 5 12 12 ...
## $ OpenRevolvingMonthlyPayment : num 24 389 0 115 220 1410 214 101 219 219 ...
## $ InquiriesLast6Months : int 3 3 0 0 1 0 0 3 1 1 ...
## $ TotalInquiries : num 3 5 1 1 9 2 0 16 6 6 ...
## $ CurrentDelinquencies : int 2 0 1 4 0 0 0 0 0 0 ...
## $ AmountDelinquent : num 472 0 NA 10056 0 ...
## $ DelinquenciesLast7Years : int 4 0 0 14 0 0 0 0 0 0 ...
## $ PublicRecordsLast10Years : int 0 1 0 0 0 0 0 1 0 0 ...
## $ PublicRecordsLast12Months : int 0 0 NA 0 0 0 0 0 0 0 ...
## $ RevolvingCreditBalance : num 0 3989 NA 1444 6193 ...
## $ BankcardUtilization : num 0 0.21 NA 0.04 0.81 0.39 0.72 0.13 0.11 0.11 ...
## $ AvailableBankcardCredit : num 1500 10266 NA 30754 695 ...
## $ TotalTrades : num 11 29 NA 26 39 47 16 10 29 29 ...
## $ TradesNeverDelinquent..percentage. : num 0.81 1 NA 0.76 0.95 1 0.68 0.8 1 1 ...
## $ TradesOpenedLast6Months : num 0 2 NA 0 2 0 0 0 1 1 ...
## $ DebtToIncomeRatio : num 0.17 0.18 0.06 0.15 0.26 0.36 0.27 0.24 0.25 0.25 ...
## $ IncomeRange : Factor w/ 8 levels "$0","$1-24,999",..: 4 5 7 4 3 3 4 4 4 4 ...
## $ IncomeVerifiable : Factor w/ 2 levels "False","True": 2 2 2 2 2 2 2 2 2 2 ...
## $ StatedMonthlyIncome : num 3083 6125 2083 2875 9583 ...
## $ LoanKey : Factor w/ 113066 levels "00003683605746079487FF7",..: 100337 69837 46303 70776 71387 86505 91250 5425 908 908 ...
## $ TotalProsperLoans : int NA NA NA NA 1 NA NA NA NA NA ...
## $ TotalProsperPaymentsBilled : int NA NA NA NA 11 NA NA NA NA NA ...
## $ OnTimeProsperPayments : int NA NA NA NA 11 NA NA NA NA NA ...
## $ ProsperPaymentsLessThanOneMonthLate: int NA NA NA NA 0 NA NA NA NA NA ...
## $ ProsperPaymentsOneMonthPlusLate : int NA NA NA NA 0 NA NA NA NA NA ...
## $ ProsperPrincipalBorrowed : num NA NA NA NA 11000 NA NA NA NA NA ...
## $ ProsperPrincipalOutstanding : num NA NA NA NA 9948 ...
## $ ScorexChangeAtTimeOfListing : int NA NA NA NA NA NA NA NA NA NA ...
## $ LoanCurrentDaysDelinquent : int 0 0 0 0 0 0 0 0 0 0 ...
## $ LoanFirstDefaultedCycleNumber : int NA NA NA NA NA NA NA NA NA NA ...
## $ LoanMonthsSinceOrigination : int 78 0 86 16 6 3 11 10 3 3 ...
## $ LoanNumber : int 19141 134815 6466 77296 102670 123257 88353 90051 121268 121268 ...
## $ LoanOriginalAmount : int 9425 10000 3001 10000 15000 15000 3000 10000 10000 10000 ...
## $ LoanOriginationDate : Factor w/ 1873 levels "2005-11-15 00:00:00",..: 426 1866 260 1535 1757 1821 1649 1666 1813 1813 ...
## $ LoanOriginationQuarter : Factor w/ 33 levels "Q1 2006","Q1 2007",..: 18 8 2 32 24 33 16 16 33 33 ...
## $ MemberKey : Factor w/ 90831 levels "00003397697413387CAF966",..: 11071 10302 33781 54939 19465 48037 60448 40951 26129 26129 ...
## $ MonthlyLoanPayment : num 330 319 123 321 564 ...
## $ LP_CustomerPayments : num 11396 0 4187 5143 2820 ...
## $ LP_CustomerPrincipalPayments : num 9425 0 3001 4091 1563 ...
## $ LP_InterestandFees : num 1971 0 1186 1052 1257 ...
## $ LP_ServiceFees : num -133.2 0 -24.2 -108 -60.3 ...
## $ LP_CollectionFees : num 0 0 0 0 0 0 0 0 0 0 ...
## $ LP_GrossPrincipalLoss : num 0 0 0 0 0 0 0 0 0 0 ...
## $ LP_NetPrincipalLoss : num 0 0 0 0 0 0 0 0 0 0 ...
## $ LP_NonPrincipalRecoverypayments : num 0 0 0 0 0 0 0 0 0 0 ...
## $ PercentFunded : num 1 1 1 1 1 1 1 1 1 1 ...
## $ Recommendations : int 0 0 0 0 0 0 0 0 0 0 ...
## $ InvestmentFromFriendsCount : int 0 0 0 0 0 0 0 0 0 0 ...
## $ InvestmentFromFriendsAmount : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Investors : int 258 1 41 158 20 1 1 1 1 1 ...
The dataset contains information of loans from the inception of Prosper.com in 2005 till the 2014 March. The questions I want to explore are:
1.Who are the borrowers and what can I learn from them?
2.How does Prosper set the interest rate for borrowers?
3.How is the performance of loan, such as the expected return and the risk of defaulted loans?
Univariate Plots Section
Customers profile
What is the # of borrowers by each state?

From the graph, we can see that there are much more borrowers in CA than in than other states. Considering that Prosper is San Francisco based company, it makes sense that it might be more popular in its homebase. Also, prosper had a large number(>5000) of customer base in FL,GA,IL,NY and TX. According to wikipedia, states (Iowa, Maine, and North Dakota) not permitted to borrow throughProsper, that is way, there is no data for those states.
What are the occupations of borrowers?

We can see that borrowers are from all kinds of professions. Also, there are many people chose “other” or “professional” not specifyting the fields of their occupation.
What are the income ranges for prosper borrowers?

The bar chart presents the amount of borrowers’ by income range groups.The verified income for most borrowers range from $25,000 to $74,999.
How is their debt to income ratio?

## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 0.140 0.220 0.276 0.320 10.010 8554
It’s a fairly symetric distribution, with its midpoint at 0.22 and the majority of borrowers have debt to income ratio less than 0.5.However, we could see there is a tail with a few borrowers having >0.75 DebtToIncomeRatio.
CreditScore of customers

Plot a histogram to show the distribution of the CreditScore of prosper borrowers.We can see from the plot, there are many whitespace between the CreditScore values, most borrowers have credit score around 700.Also, there are a few borrowers with score lower than 600 and I am quite curious about what rate they can get from Prosper.
Delinquencies History

## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 0.000 0.000 4.155 3.000 99.000 990
It is very positively skewed that the majority of borrowers don’t have any delinquencies. And there are some customers who have more than 10 delinquencies, but still could get loan from Prosper.
ProsperRating

ProsperRating is applicable for loans originated after July 2009 in forms of alpha and numeric, where AA(or 10) is the lowest risk down to HR(or 1) which stands for high risk.Take a look at how the amount of customers distribute in different ProsperRatings. A middle rating C is mostly common for customers, and only a small fraction of customers are rated AA.
ProsperScore

ProsperScore is a custom risk score and is applicable for loans originated after July 2009. Most customers have scores between 4 and 8.
How many loans originated by each quarter?

We can see the amount of loans dropped significantly from 2008 Q4, it seems that Prosper.com had lost a lot of its prosperity.According to Wikipedia, in 2008 Q4, a class action lawsuit was filed against Prosper alleging its violation of the California and federal secrities laws.After its reopen in 2009, prosper loans have continued to grow and regain its popularity.
Loan Terms

From the plot we can conclude that before 2010, there are only 36-month loans, starting from late 2010, there are 3 terms available, the most popular one is 36-month loan, and then it’s 60-month loan, 12-month term is rarely seen. Currently, it seems that 12-month term is no longer available.
How much do people usually apply for each loan ?

## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1000 4000 6500 8337 12000 35000
The boxplot shows the loan amount applied by customers with different ratings. Most of the loans applied by HR, E rated customers are less than $5000. It seems that you need to have a rating equal or higher than B to be eligible to apply loans>$25000. This to some extent explains why larger loans get better rates because they are applied by people with better prosper ratings.Referring to a post:“Prosper’s approach is to have a sliding scale of the maximum borrower amount allowed.
How much do borrowers pay on a monthly basis?

## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 131.6 217.7 272.5 371.6 2252.0
The above histogram shows the distribution of monthly payment, it is left skewed with the bulk of payment per month is under $400, the median monthly payment is around $217 and the most common payment per month is $150 dollors.
Number of Prosper loans the borrower had at the time when creating the listing

It shows more than 80000 borrowers don’t have prior loans, which means most of borrowers are first timers. Also, there are more that 15,000 listings of which the borrowers had one loan before from prosper.
The status of the loan

##
## Cancelled Chargedoff Completed
## 5 11992 38074
## Current Defaulted FinalPaymentInProgress
## 56576 5018 205
## Past Due (>120 days) Past Due (1-15 days) Past Due (16-30 days)
## 16 806 265
## Past Due (31-60 days) Past Due (61-90 days) Past Due (91-120 days)
## 363 313 304
##
## Cancelled Chargedoff Completed
## 0.0000438839 0.1052511476 0.3341671274
## Current Defaulted FinalPaymentInProgress
## 0.4965551138 0.0440418828 0.0017992399
## Past Due (>120 days) Past Due (1-15 days) Past Due (16-30 days)
## 0.0001404285 0.0070740848 0.0023258467
## Past Due (31-60 days) Past Due (61-90 days) Past Due (91-120 days)
## 0.0031859712 0.0027471322 0.0026681412
33.4% of the loans are completed, 49.7% of the loans are still current, and there is so far 4.4% defaulted rate and 10.5% chargedoff rate.
How is the loan status over time?


The Chargedoff percentage for loans orginated between 2006 and 2008 was very high, most times was even higher than 30%. Then the percentage dropped dramatically in 2009, continued to fluctuate around 0.15~0.2, and starting from 2011 Q3, the percentage was continuously dropping, which means that Prosper has improved their risk management hugely. One reminder is that some of the loans originated from 2011 are still current when the data was pulled, and most of loans originated during 2013&2014 are still current.
Loans that are chargedoff by credit rating before & after 2009

From the plot, we can see before 2009, the CreditGrade model performed badly to differentiate the risk levels, considering the relatively small volume of loans sold during that period, there were even more chargedoffs for credit grade from HR to C. Even for the best rating & lowest risk “AA”, there were more than 500 charged offs.
BorrowerAPR & BorrowerRate

## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00653 0.15630 0.20980 0.21880 0.28380 0.51230 25
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.1340 0.1840 0.1928 0.2500 0.4975
BorrowerAPR is BorrowerRate plus fees. Looking at the Borrower interest rate, the distribution is quite symetric with median equals to 0.184 & mean equals to 0.1928, except there is a spike at 0.32.
Estimated Effective Yield for Investers after July 2009

## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -0.183 0.116 0.162 0.169 0.224 0.320 29084
It looks like a normal distribution except for the spikes at around 0.25 & 0.3. The median effective yield rate is 0.162, and maximum yield rate can be as high as 0.32.
What is the structure of your dataset?
The data set contains proper loan performance data from 2005 Nov to 2014 Mar. There are 113,937 observations of 81 variables, which are of class int, factor, numeric and date.
What is/are the main feature(s) of interest in your dataset?
Starting from July 2009, borrower rates are set by Prosper’s model, Prosper determines its rates using a traditional credit score and its own proprietary Prosper score based on their own historical data.
The main features of interest are how prosper determines the BorrowerRate and what would be the return to invest in Prosper as a lender as well as how safe it is.
What other features in the dataset do you think will help support your investigation into your feature(s) of interest?
The customers’ profile including credit score, income, debt status, bankcard information and whehter or not in good standing with Prosper will help support my investigation into how to set the borrowerRate for them.
Also, the historical data including loan status, principal loss will help to complute the risk of investing in Prosper.
Did you create any new variables from existing variables in the dataset?
I created a new variable named CreditScore which is the average of upper and lower value.
I created a new variable named LoanStatus_update which regrouped the status of the loans.
I created a new variable named Phase with two values “Before 2009” and “After 2009” to denote the date of the entries since there was big change happend during 2009.
I created a new factorial variable named “ReturnCustomer” to group the borrower into new customers and returning customers.
While doing analysis, I found that distributions of variable values are quite different between these before July 2009 and after July 2009. So I created a variable named Phase to indicate whether it is after July 2009 or before it.
Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?
I sorted the factor variable into order, such as CreditScore & ProsperRating, sort them from highest risk to lowest risk.Sort the IncomeRange Variable from lowest to highest; Change the form of several variable into date format. Also, changed the LoanOriginationQuarter’s format from “Q%q %Y” to “%Y Q%q”. And I changed the Team variable type from num to factor. For variable TotalProsperLoans, I replaced the NA values with 0, for calculation ease.
Bivariate Plots Section
BorrowerRate vs. ProsperRating..Alpha for loans after July 2009
## Ord.factor w/ 7 levels "HR"<"E"<"D"<"C"<..: 6 6 3 5 2 4 7 7 4 5 ...


ProsperRating is a proprietary rating mechanism to determine the credit grade for borrowers and the rates as shown in the graph above. We can see from the plots, every line covers a wide range of credit score(for borrowers with same ProsperRating, there are huge variance in their credit score). For rating A, the credit score must be higher than 650, for AA there is a even higher minimal credit score requirement. There are sudden rate jump-ups for E & D rating with CreditScore higher than 850, which looks very abnormal, and there must be some hidden reason for this.

As you can see in the above graphic showing the borrowers’ interest rates based on their Prosper rating grades, the median interest Rate for an AA loan is below 10%, however for a HR loan is even higher than 30%. ProsperRating..Alpha is adopted by Prosper to determine the interest rate for borrowers.
BorrowerRates vs. InquiriesLast6Months

Most of the borrowers have less than 3 inquiries during last 6 months. And there seems to be a slope to show that as you have more inquiries, at a higher possibility, you will be with a higher borrower rate.Also, there is a spike at >0.3 borrower rate, with more points have >5 inquiries recently.
BorrowerRate vs. ProsperScore

ProsperScore is a custom risk score, unlike a credit bureau score, it is specifically built on the Prosper borrower population. The boxplot shows that customers assessed with lowest risk (score 11) enjoy a much lower borrower rate than those with highest risk(score 1).Median borrower rate will be lower if the score gets higher, except that between score 4 and 5, there is a fluctuation.
BorrowerRate vs. CreditScore

It is interesting to find out that before 2009, there were many borrowers with CreditScore below 600 could get loans and even some of them got very low interest rate. After the relaunch of Prosper in 2009, there is an obvious cut off of CreditScore requirement.The cut off line is close to 600, which customers must have minimal credit score of 600 to borrow through Prosper.
Mean BorrowerRate vs CreditScore for loans after July 2009

The line looks quite linear. As the CreditScore increases, the lower rate of a loan will get on average, and if the credit score reaches 850, the rate_mean remains quite flat.
Correlation between CreditScore and the mean of BorrowerRate
## [1] -0.9838036
The correlation coefficient is -0.98, which means the CreditScore and the mean BorrowerRate are strongly correlated.
BorrowerRate vs. CreditScore
## [1] -0.5090406
For data after July 2009, investigate the correlation between CredietScore and BorrowerRate. The correlation coefficient is -0.51, they are some correlation, but also there are many variation for a BorrowerRate based on the same CreditScore.
Explore what are the other variables influencing the BorrowerRate
BorrowerRate vs. Homeowner

## loan_reopen$IsBorrowerHomeowner: False
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0423 0.1474 0.2015 0.2059 0.2639 0.3600
## --------------------------------------------------------
## loan_reopen$IsBorrowerHomeowner: True
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0400 0.1269 0.1765 0.1871 0.2489 0.3500

By looking at the graph, it’s hard to tell whether there is any difference in the rate between homeowner & non-homeowner. I run a summary statistic for these tow groups, and find that the median borrower rate for homeowners is 2%+ lower than the rate for Non-homeowners. And from the histogram of BorrowerRate by Homeowners, the count of homeowners is almost twice of the count of non-homeowners.
BorrowerRate vs. Term
## loan_reopen$Term: 12
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0400 0.0929 0.1434 0.1501 0.2064 0.2669
## --------------------------------------------------------
## loan_reopen$Term: 36
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0423 0.1274 0.1899 0.1985 0.2699 0.3600
## --------------------------------------------------------
## loan_reopen$Term: 60
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0669 0.1490 0.1870 0.1930 0.2319 0.3304
The table shows that for 12-month loan, the rate is much lower. And there isn’t big difference in terms of rate between 36-month and 60-month loans.
BorrowerRate vs. Delinquencies History

There is no obvious pattern between delinquencies and BorrowerRate. I will create a factor variable based the # of delinquencies and explore further in the multivariable section.
BorrowerRate vs. LoanStatus_update

## NULL
## loan$Phase: After 2009
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0400 0.1359 0.1875 0.1960 0.2574 0.3600
## --------------------------------------------------------
## loan$Phase: Before 2009
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.1270 0.1700 0.1834 0.2365 0.4975
As the plot shows, the loans made between 2006 and 2008 had a very high percentage of chargedoff.After the relaunch in late 2009, the chargedoff rate became much lower. It seems Prosper has a much better model to evaluate qualifified loans and lower the risk for lenders.
BorrowerRate vs. DebtToIncomeRatio

## [1] 0.1264026
The correlation coefficient is 0.13. There is a weak correlation between the BorrowerRate vs. DebtToIncomeRatio.
BorrowerRate vs. Prosper History

From the upper plot, we see the median borrower rate gradually decreases as the total loans increase. The boxplot shows similar information. Although for borrowers with 6 and above transactions, the borrower rate is much lower, considering that there are very few customers borrowed that many times(data size is so small), it is not sufficient to conclude you will get a significant drop in rate if you borrow moren than 5 times.
Bivariate Analysis
Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?
The borrower rate is correlated with the CreditScore. To some extent, the higher the credit score is, the lower the interest rate the borrower gets. Also, the worse the credit rating one has, the higher borrower rate he or she gets. The borrowers applying for smaller amount (<10000) of loans tend to have a higher interest rate. Most of the borrowers with loan amount > 30000, regardless of their CreditScore, they obtain a rate lower than 0.2. This means larger loans tend to get better rates.
Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?
Chargedoff risk: For loans before 2009, there is a huge amount of changeoffs at a borrower rate around or higher than 0.3, more than half of the loans became chargedoff. For loans after 2009, the Chargedoff risk seems to jump up when the borrower rate is higher than 0.25. Comparing the chargedoff percentage by each quarter, the chargedoff percentage has been continuously decreasing since its relaunch in 2009. Prosper has a much better model to evaluate qualifified loans and lower the risk for lenders since its reopen.
What was the strongest relationship you found?
The strongest relationship I found is the mean of BorrowerRate at certain CreditScore vs. the CreditScore. Also, the BorrowerRate has a strong relationship with the ProsperRating, the better the rating, the lower the rate.
Multivariate Plots Section

The data points are very condensed to the leftmost, which means the majority of borrowers have no delinquency history. For lower ratings like HR and E, there are more data points falling to the right side of the plot.

I created a new variable named “Delinquency_history”, if the customer had 0 or NA delinquencies in last 7 years, the value would be “None”, if the number is below 4, the value is “Below Average”, if the number is equal to or above 4, the value is “Above Average”. From the plot, we can see that dots with above average delinquency are much more condense on the left side of the graph, which means they have lower CreditScore. There is no obvious pattern between BorrowerRate and Delinquency history.
Scatterplot Matrices
## [1] "ListingKey"
## [2] "ListingNumber"
## [3] "ListingCreationDate"
## [4] "CreditGrade"
## [5] "Term"
## [6] "LoanStatus"
## [7] "ClosedDate"
## [8] "BorrowerAPR"
## [9] "BorrowerRate"
## [10] "LenderYield"
## [11] "EstimatedEffectiveYield"
## [12] "EstimatedLoss"
## [13] "EstimatedReturn"
## [14] "ProsperRating..numeric."
## [15] "ProsperRating..Alpha."
## [16] "ProsperScore"
## [17] "ListingCategory..numeric."
## [18] "BorrowerState"
## [19] "Occupation"
## [20] "EmploymentStatus"
## [21] "EmploymentStatusDuration"
## [22] "IsBorrowerHomeowner"
## [23] "CurrentlyInGroup"
## [24] "GroupKey"
## [25] "DateCreditPulled"
## [26] "CreditScoreRangeLower"
## [27] "CreditScoreRangeUpper"
## [28] "FirstRecordedCreditLine"
## [29] "CurrentCreditLines"
## [30] "OpenCreditLines"
## [31] "TotalCreditLinespast7years"
## [32] "OpenRevolvingAccounts"
## [33] "OpenRevolvingMonthlyPayment"
## [34] "InquiriesLast6Months"
## [35] "TotalInquiries"
## [36] "CurrentDelinquencies"
## [37] "AmountDelinquent"
## [38] "DelinquenciesLast7Years"
## [39] "PublicRecordsLast10Years"
## [40] "PublicRecordsLast12Months"
## [41] "RevolvingCreditBalance"
## [42] "BankcardUtilization"
## [43] "AvailableBankcardCredit"
## [44] "TotalTrades"
## [45] "TradesNeverDelinquent..percentage."
## [46] "TradesOpenedLast6Months"
## [47] "DebtToIncomeRatio"
## [48] "IncomeRange"
## [49] "IncomeVerifiable"
## [50] "StatedMonthlyIncome"
## [51] "LoanKey"
## [52] "TotalProsperLoans"
## [53] "TotalProsperPaymentsBilled"
## [54] "OnTimeProsperPayments"
## [55] "ProsperPaymentsLessThanOneMonthLate"
## [56] "ProsperPaymentsOneMonthPlusLate"
## [57] "ProsperPrincipalBorrowed"
## [58] "ProsperPrincipalOutstanding"
## [59] "ScorexChangeAtTimeOfListing"
## [60] "LoanCurrentDaysDelinquent"
## [61] "LoanFirstDefaultedCycleNumber"
## [62] "LoanMonthsSinceOrigination"
## [63] "LoanNumber"
## [64] "LoanOriginalAmount"
## [65] "LoanOriginationDate"
## [66] "LoanOriginationQuarter"
## [67] "MemberKey"
## [68] "MonthlyLoanPayment"
## [69] "LP_CustomerPayments"
## [70] "LP_CustomerPrincipalPayments"
## [71] "LP_InterestandFees"
## [72] "LP_ServiceFees"
## [73] "LP_CollectionFees"
## [74] "LP_GrossPrincipalLoss"
## [75] "LP_NetPrincipalLoss"
## [76] "LP_NonPrincipalRecoverypayments"
## [77] "PercentFunded"
## [78] "Recommendations"
## [79] "InvestmentFromFriendsCount"
## [80] "InvestmentFromFriendsAmount"
## [81] "Investors"
## [82] "CreditScore"
## [83] "Phase"
## [84] "BankCardUse"
## [85] "ReturnCustomer"
## [86] "ProsperScorelevel"
## [87] "Delinquency_history"

As shown on the scatterplot matrices, the borrower rate is correlated with credit score, the correlation coefficient is -0.516. BorrowerRate has some correlation with LoanAmount, the correlation coeffecient is -0.38.
BorrowerRate by Prosper Rating and LoanAmount

## [1] -0.4122378
## [1] 0.173795
For rating HR,E,D customers, the loan amounts are small, large loan amounts(>20,000) are shown at higher ratings. Referring to Prosper, they have constraints on how much loan one can apply for based on one’s rating. For people with good rating(C,B,A,AA), we can see many white data points are more concentrated at the bottom, which means for same rating, borrowers with small amount(<5000) of loans get a lower borrower rate. However, based on the correlation coefficient(-0.41), there are some negative correlation between LoanOriginalAmount and BorrowerRate that a larger amount tends to get a lower rate. I further check the correlation coefficient of those two variables for customers with A rating, the value is 0.17. So I think the negative coefficient for overall correlation between Prosper Rating & Loan Amount is mainly driven by the good prosper ratings of borrowers with large loans.
BorrowerRate vs. BankcardUtilization & CreditScore
### Green is very dominant on the grid except for A and AA, which means most of Prosper customers are heavy users of bankcards(>=50% of credit utilization at the time when pulled their profile).As ProsperRating upgrades from HR to AA, the quantity of mild use dots increases, also appartmently the red is more dense on the right side of the green, which means generally speaking mild users of bankcard tend to have higher credit scores.However, there is no obvious pattern between BorrowerRate and bankcard utiliztion.
Estimated Return vs Rating

Linear Model
##
## Calls:
## m1: lm(formula = I(BorrowerRate) ~ I(CreditScore), data = loan_reopen)
## m2: lm(formula = I(BorrowerRate) ~ I(CreditScore) + LoanOriginalAmount,
## data = loan_reopen)
## m3: lm(formula = I(BorrowerRate) ~ I(CreditScore) + LoanOriginalAmount +
## InquiriesLast6Months, data = loan_reopen)
##
## =========================================================
## m1 m2 m3
## ---------------------------------------------------------
## (Intercept) 0.768*** 0.708*** 0.683***
## (0.003) (0.003) (0.003)
## I(CreditScore) -0.001*** -0.001*** -0.001***
## (0.000) (0.000) (0.000)
## LoanOriginalAmount -0.000*** -0.000***
## (0.000) (0.000)
## InquiriesLast6Months 0.011***
## (0.000)
## ---------------------------------------------------------
## R-squared 0.3 0.3 0.4
## adj. R-squared 0.3 0.3 0.4
## sigma 0.1 0.1 0.1
## F 29722.5 21761.9 17187.3
## p 0.0 0.0 0.0
## Log-likelihood 112679.1 117507.0 120085.3
## Deviance 350.9 313.2 294.8
## AIC -225352.2 -235006.0 -240160.6
## BIC -225324.1 -234968.6 -240113.8
## N 84984 84984 84984
## =========================================================
I use CreditScore, LoanOriginalAmount, InquiriesLast6Months to build a linear model to predict the BorrowerRate.The R-Squared is only 0.4, so there is still a lot more unexplained variations.
Multivariate Analysis
Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?
The borrower rate is highly correlation with EstimatedLoss rate, in variable description, the EstimatedLoss is explained as “Estimated loss is the estimated principal loss on charge-offs.”, which means it represents the risk of being chargeddoff, it should be estimated through a similar model to the one used to estimate the borrow rate.Both CreditScore and Loan amount have correlations with the BorrowerRate. Loan amount has an interesting impact on the rate you can get, it seems Prosper incentivizes customers to get larger loans by providing better rates.
Were there any interesting or surprising interactions between features?
Both bankcardUse and delinquency history have interactions with credit score. People with more delinquencies and more utilizations of bankcard tend to have a lower credit score.
OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.
Final Plots and Summary
Plot One

The two plots show that the loans made between 2006 and 2008 had a very high percentage of chargedoffs, which led to its shut-down in late 2008. After the relaunch in 2009, Prosper has moved away from the auction process to a new fixed rate model, in this way, Prosper could evaluate the borrowers’ qualification and decide his/her interest rate. This model proved to be very successful, as we saw the # of loans were continuously growing while the chargedoff rate was dropping since then.Prosper’s loan sales peaked in 2013 Q4, and most of loans are still current. In my opinion, it is very safe for investors, since Prosper has a solid loan approval model to lower the risk for lenders.
Plot Two

The proprietary prosper rating determines what a borrower’s interest rate will be. Although credit score has a lot to do to determine your rating, we could see ProsperRating still consider many other important factors other than credit score. Each rating grade covers a wide range of credit score. From previous exploration, I found other variables might impact your rating such as delinquency history, bankcardUse. Since the rating mechanism is owned confidentially by Prosper, there might unveiled data to determine the ProsperRating.
Plot Three

## loan_reopen$ProsperRating..Alpha.: HR
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.1827 0.1135 0.1221 0.1136 0.1246 0.1399
## --------------------------------------------------------
## loan_reopen$ProsperRating..Alpha.: E
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.0124 0.1054 0.1239 0.1247 0.1487 0.1843
## --------------------------------------------------------
## loan_reopen$ProsperRating..Alpha.: D
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.0045 0.1012 0.1163 0.1187 0.1414 0.2332
## --------------------------------------------------------
## loan_reopen$ProsperRating..Alpha.: C
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.00910 0.08227 0.09220 0.09810 0.11050 0.26670
## --------------------------------------------------------
## loan_reopen$ProsperRating..Alpha.: B
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.00100 0.07408 0.08215 0.08629 0.09260 0.28370
## --------------------------------------------------------
## loan_reopen$ProsperRating..Alpha.: A
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01780 0.06081 0.06663 0.06965 0.07284 0.18310
## --------------------------------------------------------
## loan_reopen$ProsperRating..Alpha.: AA
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01460 0.04554 0.05100 0.05399 0.05540 0.19360
Estimated return for a loan is calculated at the time when it is listed, which considers the risk of loss, such as being late or defaulted. For rating HR, E, D loans, although the median return is around 0.12 and the first quartile is also above 0.1, higher than other ratings, they do have longer whiskers below the lower quartile. Especially for HR rating loans, there are many outliers which have negative returns. From an investor point of view, it is really risky to invest in those loans. You may have a high possibility lossing your principal. For loans with good ratings(A,AA), the median return rates are around 0.06, only half of those HR/E loans, but there are many good outliers above which you can get >0.1 returns, and those are the gems you want to invest. As an investor, you are suggested to fund a portfolio of loans with a good mix of ratings.
Reflection
The loan industry is new to me, thus I spent a lot of time in understanding what is Prosper’s business model and what each variable stands for. Gaining adequate knowledge about the industry helps a lot in further analysis, in this case I could generate more insights instead of just merely presenting the data outcome.
It is important to do neccesary data munging first, such as to convert date numerics into date format, sort the levels of categorial variables into desired order, create levels to simply representing a continously scale. Successful data wrangling will make your analysis more efficient and effective.
During initial exploration, I found the big difference in data before 2009 and after the relaunch. The reason behind is Prosper changed their business model to provide fixed rate other than auction, also Prosper improved the model to qualify the borrowers.So I made a decision to create a new subset which only containing the data after reopen, in this way I could avoid the influences from the old Prosper model, and have better analysis results.
By conducting the exploratory data analysis, I am able to depict proper borrowers’ profiles. Also, I successfully discovered what factors could determine your rates, how safe it is to invest in prosper. Based on my findings, if you are a lender, it is very safe to invest in Prosper now, as the Chargedoff rate dropped significantly. As a borrower, to some extent, you will possibly get a better rate with a higher credit score, but there are still some other factors would impact your rate such as your recent credit inquiries, loan amout etc.
I was really struggling to come out a liner model with a good fit to predict the borrower rates. But I didn’t find other variables to be incorporated in the model to improve its performance. Also, when conducting bivariable analysis, there are many noises to influence the two variables you want to explore, so it is very critical to choose a good visualization (type of plot, color, facet_wrap etc) or find a way to compute statistics (mean, median) to describe & compare the two groups.
There are some other variables I haven’t explored yet, like the variables related to credit lines & revolving accounts. They may have important information as to predict the rates. On Prosper.com, it is saying that they use historical preformance data to make predictions, I think they deploy some machine learning techniques to make this happen. I am wondering what kind of algorithm they use. Also, I am very curious about the rejected loan applications. This data set only has listed & approved loans, how about the application turned down by prosper, what characteristics of them don’t meet prosper’s minimal requirement.